An Extremely Data-efficient and Generative LLM-based Reinforcement Learning Agent for Recommenders Shuang Feng∗ fengshuang@gmail.com Stanford University SCPD Palo Alto, California, USAGrace Feng† gracefeng@ucsb.eud University of California Santa Barbara Santa Barbara, California, USA Abstract Recent advancements in large language models (LLMs) have en- abled understanding webpage contexts, product details, and human instructions. Utilizing LLMs as the foundational architecture for either reward models or policies in reinforcement learning has gained popularity - a notable achievement is the success of Instruct- GPT [ 11]. RL algorithms have been instrumental in maximizing long-term customer satisfaction and avoiding short-term, myopic goals in industrial recommender systems, which often rely on deep learning models to predict immediate clicks or purchases. In this project, several RL methods are implemented and evalu- ated using the WebShop [ 17] benchmark environment, data, sim- ulator, and pre-trained model checkpoints. The goal is to train an RL agent to maximize the purchase reward given a detailed human instruction describing a desired product. The RL agents are developed by fine-tuning a pre-trained BERT model with various objectives, learning from preferences without a reward model, and employing contemporary training techniques such as Proximal Policy Optimization (PPO) as used in Instruct- GPT [ 11], and Direct Preference Optimization (DPO) [ 12]. This report also evaluates the RL agents trained using generative trajec- tories. Evaluations were conducted using Thompson sampling in the WebShop simulator environment. The simulated online experiments demonstrate that DPO out- performs PPO in data efficiency and task performance, especially in success rate, using the same amount of training time. However, longer training time is necessary for fair comparison between the two. Specifically, without utilizing any image, a DPO agent achieved a 19% success rate after approximately 3000 steps or 30 minutes of training on T4 GPUs, compared to a PPO agent, which reached a 15% success rate after 2 hours of training. Results also indicate that agents trained on generated trajectories exhibited comparable task performance to those trained using human trajectories. This has demonstrated an example of an extremely low-cost data-efficient way of training reinforcement learning agents. Keywords LLM, Reinforcement Learning, Recommender, Contrast Learning, Generative AI, RLHF, Human Preference, E-commerce 1 ∗Shuang Feng is the corresponding author and the technical contributor of the paper †Grace Feng assisted data processing, running experiments, and editing the paper 1This paper was originally part of the class project for CS234 Spring 2024 in Stan- ford University and was submitted to KDD’24 on 6/30/2024 for RelKD Workshop. It was accepted in July 2024. See https://github.com/fengshuang-coding/KDD2024 for updates.1 Introduction Recent advances in Large Language Models (LLMs) have signifi- cantly enhanced research and applications in understanding human instructions on the web, processing webpage text, and grasping con- text. These advancements have provided valuable tools for training reinforcement learning (RL) agents to navigate web environments, particularly in e-commerce and various recommender systems such as YouTube and Netflix. Leveraging LLMs in RL agent training is relatively new but has proven successful. A notable example is InstructGPT [ 11], where an RL agent was trained using human preferences by fine-tuning GPT-3 models with human instructions. Combining LLMs with RL techniques enables the creation of in- telligent web agents that can understand human instructions and complete tasks in web or app environments, thereby maximizing desired rewards. Recommender systems have evolved from collaborative filtering [6] to the recent surge in deep supervised learning, which predicts immediate user responses such as clicks [ 4,18]. This approach has seen tremendous success in personalized user engagement. How- ever, after several years in production, deep supervised learning algorithms have shown limitations, including: 1) a focus on optimiz- ing short-term gains at the expense of long-term user satisfaction and retention, and 2) strong feedback loops caused by training data generated from these algorithms, which exacerbate these effects. Conversely, RL algorithms are designed to optimize long-term gains by learning policies that maximize long-term user satisfaction. RL agents are also well-known for their ability to perform sequential planning and make decisions based on the Markov Decision Process (MDP) properties [2]. The training of RL agents for recommenders in web environ- ments has been actively studied, with several benchmark datasets and trained agents available. For example, WikiNav [ 10] provides a benchmark for web-based navigation RL agents. RecoGym [ 13] offers a benchmark for RL agents in production recommendations for online advertising. Virtual-Taobao [ 15] includes a virtual online shopping environment derived from Taobao, hosting several RL algorithms for product recommendations. WebShop [ 17] presents a simulated e-commerce web environment with over 1,600 human demonstrations for web shopping tasks based on human text in- structions. This environment includes 1.18 million products with text and image descriptions, along with 12,087 crowd-sourced text instructions. The authors of WebShop also explored several imita- tion and RL agents trained using real-world human trajectories. Previous explorations in RL for web-based recommenders are extensive. Query reformulation, as published in [ 10], is part of an RL problem aimed at optimizing outcomes. In this context, searcharXiv:2408.16032v1 [cs.LG] 28 Aug 2024 ACM KDD’24 RelKD 2024 Workshop: submitted on 6/30/2024, accepted in July 2024, August 25-29, Barcelona, Spain Feng et al. engines are considered black boxes, and the RL agent (or reformu- lator) learns to generate queries that maximize the expected return through actions in the state space. This paper, published in 2017, predates the widespread use of BERT [ 5]. The authors proposed a PRF framework, with CNN/RNN serving as the contextual learner and query generator. A more recent work proposed the concept of "learning to search" [ 1], where a search agent mimics the interactive process by generating interactive search queries based on previous queries and saving the best queries along the way. The authors used the T5 model with fine-tuning as a query generator to interact with the search engine iteratively, producing a set of fine-grained queries that yield better outcomes. Another related work, WebGPT [9], utilizes a web interface and a search engine to train RL agents to answer questions. 2 Related Work The work presented in this paper is built and evaluated within the WebShop [ 17] environment, a simulator of online web shopping recommender system. 2.1 The WebShop Environment WebShop is a benchmark project designed to train reinforcement learning algorithms in a large-scale, interactive, web-based envi- ronment. It includes over 12,000 crowdsourced human instructions, over 1.1 million products scraped from amazon.com. A total of 670 attributes were derived from concatenated product titles and descriptions using bi-gram representations and assigned to each product through TF-IDF scoring. Figure 1 and Figure 2 below provide an example WebShop inter- face and a sequence of actions. Figure 1: WebShop Environment [17] The original paper tackles the problem into two sets of reinforce- ment learning models for search and choice (or clicks). The search model is an imitation learning model (search-IL) mimicking human search queries from instructions. It is a human instruction and query pair fine-tuned BART [ 7] model in root. For choice (clicks) learning, the authors present a few reinforcement learning models to optimize choice of clicks navigating the recommender simulator to optimize the end rewards (purchase). The reward is calculated by a scoring function to quantify the relevance between a purchased product and the human instruction, based on attributes of the prod- uct. The imitation learning algorithm (choice-IL) presented by theFigure 2: WebShop Human Instructions and Human Trajec- tories [17] original authors is a human trajectory fine-tuned BERT [5] model in root. The reinforcement learning algorithm for choice iterates the imitation learned (choice-IL-RL), fine-tuned BERT model as the baseline and iterates the optimization using a mixed objectives of policy gradients and cross entropy. The state space in this problem consists abstraction of four types of webpages: search page, product recommendation page, prod- uct page, and product detail page. The search page features only a search bar for entering instructions, which are used to take a search query either generated by human or a search agent, serving as input for a search engine. Actions include searching, clicking but- tons, and choosing from a drop-down menu. Clicking the purchase button marks the end of a trajectory. State transitions are initiated by clicks and other actions that deterministically redirect from one webpage (state) to another. Observations, which includes state and instruction at a specific time snapshot, together form the input for the reinforcement learning agent to make subsequent actions. The search engine used by the WebShop project is self-built and self-indexed offline using Pyserini [ 8], which is built upon the open-source Lucene search library. Product retrieval is based on BM25 between search queries and product information text. Top 50 results are shown in 5 pages ranked by BM25. 2.2 Reinforcement Learning with Human Preference - RLHF RLHF [ 3] together with PPO [ 14] were successfully used to train a few well-known GPT related product, such as instructGPT [ 11]. RLHF leverages the Bradley-Terry model, which defines the prefer- ence using rewards of preferred and dispreferred data labeled by human labelers: 𝑃(𝑦𝑙≻𝑦𝑤|𝑥)=𝑒𝑥𝑝(𝑟(𝑥,𝑦𝑙)) 𝑒𝑥𝑝(𝑟(𝑥,𝑦𝑙))+𝑒𝑥𝑝(𝑟(𝑥,𝑦𝑤)). RLHF objectives then can be defined similarly to entropy loss. An Extremely Data-efficient and Generative LLM-based Reinforcement Learning Agent for Recommenders ACM KDD’24 RelKD 2024 Workshop: submitted on 6/30/2024, accepted in July 2024, August 25-29, Barcelona, Spain 2.3 PPO for Regularized Policy Gradient Proximal Policy Optimization (PPO) [ 14] has been demonstrated to be effective in fine-tuning GPT models with human instructions and labeled preferences [ 11]. PPO uses clipping or KL divergence constraints to minimize the likelihood of large updates between steps, approximately providing guarantees for monotonic improve- ment. This approach converges in probability to local optima and, in practice, results in more stable training outcomes. The clipped loss function for policy gradient in PPO can be expressed as: 𝐿𝑃𝑃𝑂 𝜃𝑘=−𝐸𝜏∼𝜋𝑘 𝑚𝑖𝑛(𝑧𝑡(𝜃)ˆ𝐴𝜋𝑘 𝑡,𝑐𝑙𝑖𝑝(𝑧𝑡(𝜃),1−𝜖,1+𝜖)ˆ𝐴𝜋𝑘 𝑡 (1) , where 𝑧𝑡(𝜃)=𝜋𝜃(𝑎𝑡|𝑜𝑡) 𝜋𝜃𝑘(𝑎𝑡|𝑜𝑡), ˆ𝐴𝜋𝑘 𝑡=𝑅𝑡−𝑉𝜋𝑘(𝑜𝑡). 2.4 Learning with Human Preference - DPO The development of Direct Preference Optimization (DPO) [ 12] is revolutionary. It eliminates the need for explicit reward functions for preferences and instead relies solely on paired preference tra- jectories as training data. DPO is derived from joining the Bradley- Terry objective 𝐿𝐵𝑇(𝑟,𝐷)=−𝐸(𝑥,𝑦𝑤,𝑦𝑙)∼𝐷[𝑙𝑜𝑔𝜎(𝑟(𝑥,𝑦𝑤)−𝑟(𝑥,𝑦𝑙))] , and the RLHF objective: max𝜋𝐸𝑥∼𝐷,𝑦∼𝜋[𝑟(𝑥,𝑦)]−𝛽𝐷𝐾𝐿[𝜋(𝑦|𝑥)||𝜋𝑟𝑒𝑓(𝑦|𝑥)]. The DPO loss function is: 𝐿𝐷𝑃𝑂(𝜋𝜃,𝜋𝑟𝑒𝑓)=−𝐸(𝑥,𝑦𝑤,𝑦𝑙)∼𝐷[𝑙𝑜𝑔𝜎(𝛽𝑙𝑜𝑔𝜋𝜃(𝑦𝑤|𝑥) 𝜋𝑟𝑒𝑓(𝑦𝑤|𝑥) −𝛽𝑙𝑜𝑔𝜋𝜃(𝑦𝑙|𝑥) 𝜋𝑟𝑒𝑓(𝑦𝑙|𝑥))]. , where𝜋𝜃is the DPO policy to learn and 𝜋𝑟𝑒𝑓is the pre-selected reference policy. From the form of the loss function, although DPO does not need a reward model to be trained explicitly, it does require a pre-defined reference policy to iterate upon. 3 Approach This report summarizes a few efforts implementing and evaluating PPO vs. contrast learning using DPO using human trajectories and generated unpreferred trajectories. Then it demonstrates the con- trast learning effort using all generative trajectories. We branched and implemented DPO and PPO using the original WebShop code package, together with a new generative module for the self-generative learning experiments, and a Thompson sam- pling module to roll out online experiments and collect results. For PPO training, the policy gradient objective from the original paper [ 17] is modified into a PPO objective as shown in equation (1). The overall objective components, which is the total loss from policy gradient (PG), entropy loss, and imitation learning loss remain thesame as in the original paper, except PG component is replaced with PPO loss. 3.1 Semi-generative Reinforcement Learning Using Human Trajectories For this project, we utilize a pre-trained imitation learning agent checkpoint as the reference policy to generate unpreferred trajecto- ries. Preferred trajectories are obtained from human data provided by the WebShop benchmark. During training, a human trajectory is randomly sampled, including states and available actions from the log. At each state where an action decision is needed, an unpreferred action is generated using the reference policy. This unpreferred action is paired with the preferred action generated by the human. The DPO update is applied after each episode based on the human trajectory. This approach is considered both generative and semi-self-learning. It is generative because we use a predefined unpreferred policy to generate actions for pair-wise training. It is semi-self-learning be- cause it pairs these generated actions with previously collected human trajectories, which serve as the gold standard. 3.2 Self-learning - Training with Generated Trajectories In classic reinforcement learning, self-play or learning through simulation plays a crucial role, particularly when data collection is costly, such as the collection of human trajectories in this problem. Self-play has proven to be effective, with the most notable example being AlphaGo [16]. To evaluate the idea of self-play or self-learning in navigating the WebShop recommendation systems, we generated 100 preferred trajectories using a straightforward method of sampling trajectories with perfect reward (score = 1). This sampling was done using the agent checkpoint from imitation learning provided by the authors of the WebShop paper, but with real-world human instructions. Ideally, these sampled trajectories are pruned to eliminate looped sub-trajectories. A DPO agent is then trained from the same check- point used for DPO evaluation in the previous section, with 3000 steps. Task performance between these two DPO agents — one trained using semi-learning with human trajectories and the other using self-learning with generated trajectories — is compared using Thompson sampling ran in the WebShop simulator environment. 4 Experimental Results 4.1 DPO vs. PPO Task Performance In this project, leveraging the WebShop environment and simu- lator, we conduct extensive simulated online experiments using Thompson sampling to analyze the performance differences across trained agents. The goal of Thompson sampling is to select the optimal action (or "arm") that minimizes overall regret. However, when sampling over a small number of steps, it may not be ideal for estimating rewards from arms that are perceived as less optimal due to insufficient exploration. To address this, we use multiple parallel runs of Thompson sampling, each with 1000 rollouts, to capture variability across runs. Careful experimental design and ACM KDD’24 RelKD 2024 Workshop: submitted on 6/30/2024, accepted in July 2024, August 25-29, Barcelona, Spain Feng et al. calculated rollouts of online experiments are necessary for accu- rately estimating the rewards and success rates of each agent. The aim of this project is to implement and understand the performance trends across different approaches. The results indicate that Direct Preference Optimization (DPO) agents achieve significantly higher scores and success rates com- pared to Proximal Policy Optimization (PPO) agents, even though all agents start from the same imitation learning BERT model check- point provided by the original paper. It is important to note that all agents in this comparison are trained without image data, so the scores and success rates collected are not directly comparable to the original paper, which includes image data in training and experiments. An interesting finding is that DPO agents trained using human trajectories perform similarly to DPO agents trained using gener- ated trajectories, albeit with larger variance in success rate across runs. The smaller variance observed in self-learning agents can be attributed to the fact that only 100 generated trajectories were used to train the DPO self-learning agent, compared to 1200 human trajectories used for training the DPO agent with human data. The fact that DPO agents are trained using only 3000 steps also suggests the possibility of underestimating data inefficiency or bottleneck when training over long period of times using the same set of data. When training an agent for production systems, the limited number of available trajectories can result in decreased task performance due to insufficient information learned from the limited data. In reality, collecting human data is expensive and time- consuming. This issue can be mitigated by generating preferred and unpreferred trajectories to serve as a continuous, low-cost source of training data. Figure 3: DPO vs. PPO — Human Trajectories and Generated Trajectories — Scores It is important to note that the results of this project are not directly comparable to those of the original paper due to two key differences: 1) no image data were used for training or experiments for any of the agents evaluated in this project, and 2) each agent was trained with minimal steps (3000) and within a timeframe of less than one hour. The purpose of this project is not to bench- mark results but to investigate variations in reinforcement learning algorithms.Figure 4: DPO vs. PPO — Human Trajectories and Generated Trajectories — Success Rate Training using fully generated preferences on top of the DPO agent achieved much higher scores than DPO agents using human trajectories, while the success rate remained similar ( Figure 5 ,Figure 6). The magnitude of this difference needs to be justified using variance across runs, but this finding demonstrates the potential of using generative data to enhance training on top of existing agents initially trained with human trajectories. Figure 5: Self-learning Using Generated Trajectories — Scores 5 Conclusion With very limited training time (<1 hour), Direct Preference Opti- mization (DPO) outperforms Proximal Policy Optimization (PPO), offering better task performance and higher success rates with less training time. However, more evaluations with longer training time are necessary to draw a conclusion. Using a DPO agent trained within one hour and without image data, we achieved a success rate of approximately 19%. This is higher than the success rate of an RL agent trained with an RNN network (without pre-trained models for search or choice imitation learning), which has an 18% success rate from the original paper. An Extremely Data-efficient and Generative LLM-based Reinforcement Learning Agent for Recommenders ACM KDD’24 RelKD 2024 Workshop: submitted on 6/30/2024, accepted in July 2024, August 25-29, Barcelona, Spain Figure 6: Self-learning Using Generated Trajectories — Suc- cess Rate PPO is known to be able to provide less volatile training and approximately monotonic guarantees for RL objectives. By nature, PPO’s regularization and clipping of objectives prevent rapid policy changes, making it suitable for problems with smaller state and action spaces where large policy changes are not expected. How- ever, in the context of online product recommenders, where the state-action space can expand to millions of dimensions, and rapid policy changes are essential for fast learning, PPO can need longer time to train. Training DPO agents with generated trajectories has shown great potential. With only 100 generated human trajectories and the same amount of computational resources, the task performance was comparable to a DPO agent trained using 1200 human trajecto- ries. This approach addresses data inefficiency and the high costs of human data collection. As training requires more time and data, the limited availability of human data can hinder continuous im- provement. This exercise demonstrates that generated trajectories can be nearly as effective as human trajectories and can even serve as a continuous, low-cost source of training data. Additionally, gen- erated trajectories allow exploration of successful paths not seen by humans, similar to the approach that contributed to the success of AlphaGo[ 16] and AlphaZero, which were trained using self-play rather than past human games. 6 Potential Usage: Using Trained Agents as a Recommender Using reinforcement learning agent in recommenders is not new - it is known to be used in online recommenders such as Youtube. The trained optimal policy can become an ideal ranking algorithm for recommender systems. Starting from a human instruction, the agent simulates navigating through a provided list of product following the trained optimal policy and provides a "purchased" product from each run. When using in recommenders, multiple runs of the agent provide a list of recommended product to user and the order to present in a user interface, such as on web or in an app can berank-ordered by scores or success for each recommended product from each run of the RL agent. Acknowledgments To the WebShop authors Shunyu Yao, Howard Chen, John Yang and Karthik Narasimhan from Princeton University who published [17], which inspired this project report. To Professor Chris Potts who has introduced WebShop [17] to the author of this report, and for his excellent teaching CS224U in Stan- ford University. To Professor Emma Brunskill for her excellent teaching CS234 in Stanford University. References [1]Leonard Adolphs, Benjamin Börschinger, Christian Buck, Michelle Chen Hueb- scher, Massimiliano Ciaramita, Lasse Espeholt, Thomas Hofmann, and Yan- nic Kilcher. 2021. Boosting Search Engines with Interactive Agents. CoRR abs/2109.00527 (2021). arXiv:2109.00527 https://arxiv.org/abs/2109.00527 [2]Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H. Chi. 2018. Top-K Off-Policy Correction for a REINFORCE Recommender System. CoRR abs/1812.02353 (2018). arXiv:1812.02353 http://arxiv.org/abs/1812.02353 [3]Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2023. Deep reinforcement learning from human preferences. arXiv:1706.03741 [stat.ML] [4]Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (Boston, Massachusetts, USA) (RecSys ’16) . Association for Computing Machinery, New York, NY, USA, 191–198. https://doi.org/10.1145/ 2959100.2959190 [5]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2019). arXiv:1810.04805 [cs.CL] [6]Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42, 8 (2009), 30–37. https: //doi.org/10.1109/MC.2009.263 [7]Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Association for Computational Linguistics, Online, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703 [8] Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Frassetto Nogueira. 2021. Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations. CoRR abs/2102.10073 (2021). arXiv:2102.10073 https://arxiv.org/abs/2102.10073 [9]Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. 2021. WebGPT: Browser- assisted question-answering with human feedback. CoRR abs/2112.09332 (2021). arXiv:2112.09332 https://arxiv.org/abs/2112.09332 [10] Rodrigo Frassetto Nogueira and Kyunghyun Cho. 2017. Task-Oriented Query Reformulation with Reinforcement Learning. CoRR abs/1704.04572 (2017). arXiv:1704.04572 http://arxiv.org/abs/1704.04572 [11] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems , S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 27730–27744. https://proceedings.neurips.cc/paper_files/paper/2022/file/ b1efde53be364a73914f58805a001731-Paper-Conference.pdf [12] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. (2023). arXiv:2305.18290 [cs.LG] [13] David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou. 2018. RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising. CoRR abs/1808.00720 (2018). arXiv:1808.00720 http://arxiv.org/abs/1808.00720 ACM KDD’24 RelKD 2024 Workshop: submitted on 6/30/2024, accepted in July 2024, August 25-29, Barcelona, Spain Feng et al. [14] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. CoRR abs/1707.06347 (2017). arXiv:1707.06347 http://arxiv.org/abs/1707.06347 [15] Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and An-Xiang Zeng. 2019. Virtual-Taobao: Virtualizing Real-World Online Retail Environment for Rein- forcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (Jul. 2019), 4902–4909. https://doi.org/10.1609/aaai.v33i01.33014902 [16] David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Pan- neershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham,Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529 (2016), 484–503. http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html [17] Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2023. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. (2023). arXiv:2207.01206 [cs.CL] [18] Shuai Zhang, Lina Yao, and Aixin Sun. 2017. Deep Learning based Recom- mender System: A Survey and New Perspectives. CoRR abs/1707.07435 (2017). arXiv:1707.07435 http://arxiv.org/abs/1707.07435